Creating a Sankey Plot to Simplify Flow Processes

Intro

Sometimes I've found processes can get pretty complicated to explain. During those moments of quiet exhasperation, there's only one thing to do... MAKE A SANKEY PLOT!

Seriously, though, Sankey plots can often help visualize something that would be much more difficult to explain in words. For example, let's say that I want to take a group of elementary school students on a field trip, and we had them vote on whether they wanted to go to a museum or an aquarium.

We can illustrate this using a Sankey diagram, which shows a flow process from one categorical variable to another. Here's a simple illustration of the above example I made using Draw.io.

Sankey%20Example%20for%20Website.drawio.png

As you can see, we started with 175 total students on the left. We divided those 175 students into which grade they were in, whether they were girls or boys, and finally, which location they voted on for the field trip.

The plot I made by hand is good, but I'd really prefer to code it if possible. To do this, we're going to use the plotly package in python.

Getting Started

Examining the documentation for Sankey diagrams at the Plotly website, we can see that we need four pieces of information to make a Sankey diagram.

  1. Node labels
  2. Source values
  3. Target values
  4. Numeric values

This might seem complicated, but don't fret! First, we need our node labels, which is just the text represented in our boxes above.

We'll read the information from the chart we made starting from the top and moving down, and from the left moving right, like we're reading columns in a newspaper.

We also need to create a matching unique identifier to each of our node labels. To do this, we just need to enumerate over each entry in our node_labels variable.

Now we need our source values, target values, and numeric values.

Our source values and target values will use the exact same strings as we used in our node labels.

If you examine the arrows in our diagram above, the left side of the arrow accounts for our source values. For example, the "Students" box has two left sides of arrows attached to it. That means we need to account for it being the source of a flow output two times. We can go through the whole plot in such a way.

Notice that the source list does not include either "Museum" or "Aquarium", because neither of those boxes are the origin points for any arrows.

Now we need to enter the target values. This is similar to our last task, except this time we're accounting for the end points of the arrows. Again, we're moving from top to bottom and from left to right.

In this case, the "Students" box has zero arrows that end on it, so we'll begin with the "4th Grade" box. The "4th Grade" box has one arrow ending on it, so we list it one time. Again, we go through the diagram accounting for each arrow endpoint.

Now we just need to incorporate the values we have listed in the diagram for each flow process. Again, we go through the diagram from top to bottom, left to right.

If you get confused, you can always just look at each number entry in our source and target list. Our first entry in our source list is "Students", and our first entry in our target list is "4th Grade", so our first entry in our values list will be the value of students in the 4th grade.

Finally, we need to apply our enumerated unique labels to our sources and targets.

Now that we have all of the components to our Sankey diagram, we're ready to plot!